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ABSTRACT 


This thesis deals with the problem of measuring system performance in the presence of un- 
certainty. The system under consideration may be as simple as an Army vehicle subjected to 
a kinetic attack or as complex as the human cognitive process. Information about the system 
performance is found in the observed data points, which we call hard information, and may be 
collected from physical sensors, field test data, and computer simulations. Soft information is 
available from human sources such as subject-matter experts and analysts, and represents qual- 
itative information about the system performance and the uncertainty present. We propose the 
use of epi-splines in a nonparametric framework that allows for the systematic integration of 
hard and soft information for the estimation of system performance density functions in order 
to quantify uncertainty. We conduct empirical testing of several benchmark analytical exam- 
ples, where the true probability density functions are known. We compare the performance of 
the epi-spline estimator to kernel-based estimates and highlight a real-world problem context to 


illustrate the potential of the framework. 
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Executive Summary 





This thesis deals with the problem of measuring system performance in the presence of uncer- 
tainty. The basic problem construct is that we have stochastic inputs to a system and, therefore, 
the output performance measure(s) is stochastic. We seek a quantitative description of the out- 
put in terms of a probability density which will quantify the uncertainty about the output in 


question. We refer to this process as uncertainty quantification (UQ). 


The system under consideration may be as simple as an Army vehicle subjected to a kinetic 
attack or as complex as the human cognitive process. Information about the system performance 
is found in the observed data points, which we call hard information, and may be collected 
from physical sensors, field test data, and computer simulations. Soft information is available 
from human sources such as subject-matter experts and analysts, and represents qualitative 


information about the system performance and the uncertainty present. 


A framework for systematically incorporating hard and soft information in the estimation of 
probability density functions, regression curves, and in other contexts using epi-splines has 
been developed by Dr. Roger J-B Wets from the University of California, Davis. We propose 
the use of this epi-spline framework for the estimation of density functions representing system 
performance in order to quantify uncertainty. We conduct empirical testing of several bench- 
mark analytical examples where the true probability density functions are known. We compare 
the performance of the epi-spline estimator to kernel-based estimates and highlight a real-world 
problem context to illustrate the potential of the approach. 


The problem of uncertainty quantification is particularly challenging when small data samples 
are available from which to estimate the true underlying probability density. The empirical 
testing in this thesis focuses on small data samples and highlights the use of various sources 
of soft information. Constraint formulations for the soft information are presented and tested 
in several example cases. We find that with as few as five data observations, reasonably good 


density estimates can be produced using the epi-spline estimator. 


In comparison to traditional kernel-based, nonparametric density estimates, the epi-spline esti- 
mates consistently produce smaller average mean square errors (MSE) over fifty replications. 
Epi-spline estimates based on five data points and no soft information have average MSEs of 
23-48% less than the corresponding kernel estimates. In many cases, even more significant im- 


provements are seen in the variability of the MSE statistics of the epi-spline estimates over the 


XVii 


kernel estimates. We see standard deviation reductions of 88-96% over kernel estimates when 
no soft information is used. In estimates using two sources of soft information, we see a further 
reduction in the average MSE of up to 65% and 56% in standard deviation beyond that of the 


no soft information estimates. 


Lastly, we use cognitive performance data from a Habitability Assessment Test (HAT) that 
tests the impact of waterborne motion exposure on U.S. Marines under various conditions. The 
unique flexibility of the epi-spline estimator is emphasized to show how issues inherent in real- 


world testing situations can be mitigated. 
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CHAPTER 1: 
INTRODUCTION 





Leaders and organizations face many difficult decisions under conditions of uncertainty. Most 
importantly, the uncertainty surrounding the output from a given system poses a significant 
challenge to decision makers. The system can be relatively simple such as an Army vehicle or 
highly complex such as an entire battlefield. Regardless the scale of the system, it is desirable to 


be able to estimate the present and future system performance using all available information. 


However, there often exists uncertainty about the information available and how that infor- 
mation is acted upon by the system under consideration. For example, incomplete situational 
awareness about present conditions as well as imprecise forecasts about future events create un- 
certainty. As a result, and along with the complexity of the system itself, there exists uncertainty 
about the output of the system. The ability of the decision maker to have a clear understanding 
of what the output uncertainty looks like, i.e., its (joint) probability distribution, shape char- 
acteristics, moments, quantiles, tail behavior, etc., can have a significant impact on decisions 


made and courses of action taken. 


As defined by Eldred and Swiler [1], uncertainty quantification (UQ) is "the process of deter- 
mining the effect of input uncertainties on response metrics of interest." More specifically, UQ 
deals with estimating the characteristics of the probability distributions of the output from com- 
plex systems, which have probabilistic data as input. While this can be addressed reasonably 
well with traditional statistical methods when large amounts of data are available, decisions are 


often required based on limited data. 


Uncertainty quantification, a field relatively young in name, is also growing in research interest. 
The American Statistical Association (ASA) and Society for Industrial and Applied Mathemat- 
ics (SIAM) recently announced a joint effort to create the Journal on Uncertainty Quantification 
(JUQ) [2] with the goal of highlighting the interdiscriplinary nature of UQ. Berger et al. define 


the field in somewhat broader terms: 


uncertainty quantification (UQ) in computational science and engineering has to 
do with describing the effects of error and uncertainty on results based on simula- 


tion and prediction of the behavior of constructed models of phenomena in physics, 


biology, chemistry, ecology, engineered sytems, politics, etc. ... Results from math- 
ematical modeling are subject to errors and uncertainty emanating from a variety of 
sources, including uncertainty in data obtained from experiment and observation; 
limitations of physical modeling, including uncertain coefficients, approximation, 
and the need for emulation; problems in computer codes; and the difficulty of com- 


bining models into integrated systems. 


The description provided above and the timely development of a new journal dedicated to UQ 


highlight the importance of the research presented in this thesis. 


1.1 Motivation and Background 

The task of measuring system performance, and quantifying the uncertainty resident in the 
performance, based on small data samples is a particularly challenging problem. Consider, for 
example, a military research and development field test situation in which the suvivability of a 
proposed combat vehicle is being measured. Simulations have been conducted, but stakeholders 
want to make their decision based on results from real-world field test data. The test will subject 
a sample of the vehicles to an improvised explosive device (IED) of a given magnitude and, 


based on some quantitative measure, assign the observation a survivability score. 


The challenge is that the sample used for the test will be quite small, maybe only five vehicles. 
The vehicles are in limited testing production and very expensive. How then do we get an ac- 
curate view of the uncertainty about the survivability from only five hard data observations? 
Hard information is defined as specific data points compatible with traditional statistical esti- 
mation techniques. This hard information is often gathered from physical sensors, test data, and 


computer simulations. 


Because we may not want to make explicit assumptions about the performance measure com- 
ing from a particular family of probability distributions, we enter the realm of nonparametric 
estimation. In this thesis, we focus exclusively on cases where the system performance is de- 
scribed by a probability density function. Hence, we exclude the possibility of, for example, 
integer-valued performance measures. By utilizing nonparametric means, we maintain flexi- 
bility in the design characteristics of the density estimate. This is especially important in the 
context of small data sets because the assumption of a parametric distribution family imposes 


many constraints that may not be desirable. 


The next question to consider is whether the hard information is all that is known about the 
behavior of this output variable. In many cases there is at least some amount of information 
that is known about the output and how the system in question acts on the input. This soft 
information is defined as that which is derived from human sources such as human intelligence, 
signal intelligence, and the experience of analysts. The soft information is often more qualitative 


in nature coming from a human understanding of characteristics of the system output. 


Engineered systems are often represented using models consisting of differential and algebraic 
equations. Let G(@) be the solution of these equations, or aggregated quantities derived from 
the solution, for a particular choice @ of parameters in the equations. Since the selection of 
parameters is often subject to uncertainty due to incomplete knowledge about material prop- 
erties, applied loads, boundary conditions, and environmental factors, the system performance 
&€ = G(q@) is uncertain. For example, in vehicle design [3] the system performance € = G(@) 
may give a measure of occupant and structural damage in the case of impact or blast of type @. 
Since the strength and location of the blast as well as the material properties would typically be 
unknown, @ is viewed as a random vector. Consequently, the damage = G(q@) is also a ran- 
dom vector. Knowledge about the underlying differential and algebraic equations may provide 
information about the range of G, the monotonicity of the density of €, and other factors, which 


we include as soft information in the estimation problem. 


Army simulations such as OneSAF, COMBAT XXI, AWARS, and JDOLM model various aspects 
of complex military operations. Simulations of this kind rely on the specification of numerous 
input parameters @, which, after running the simulation, result in an output € = G(q@) that may 
represent performance metrics such as attrition, enemy losses, and supply level. Since there 
may be significant uncertainty about the value of the input parameters, the system performance 
is uncertain and needs to be quantified. Experienced analysts may provide knowledge about 
the nature of the simulations that can be included as soft information in the estimation of the 


density of €. 


The cognitive ability of human and autonomous systems under adverse conditions are often 
subject to significant uncertainty. Still, analysts need to plan for this uncertainty based on 
limited data. For example, recent field experiments with U.S. Marines landing on a beach under 
various degrees of stress and fatigue caused by rough seas and other factors, show significant 
variability in the Marines’ cognitive abilities [4].! Since the intensity and duration of waterborne 





'In fact, the referenced field experiment with U.S. Marines is the subject of application presented in Chapter 4. 


motion during a military operation are unknown a priori, we view them as a random input @ toa 
Marine’s cognitive process, represented by the function G, which results in a random cognitive 


performance € = G(@). 


It is critical for planners to know the density of € or at least some of its moments to better an- 
ticipate required unit strength and support. Clearly, G is not known explicitly, but the density of 
& can be estimated based on field tests of the kind presented in [4]. In this situation, contextual 


knowledge may provide important soft information that improves the estimates significantly. 


There are many real world contexts and situations where the need to quantify the uncertainty of 
a particular system performance measure will present itself. The examples discussed above are 
merely illustrative of several interesting application areas. We will not examine cases from all 
of these example areas. This thesis presents test cases of several simple univariate functions, a 
more complex engineering example, and a real world application related to the human cognitive 


domain. 


A framework for systematically incorporating hard and soft information in the estimation of 
probability density functions, regression curves, performance functions, and other quantities, 
particularly in the context of small data sets, using epi-splines has been developed by Dr. Roger 
J-B Wets from the University of California, Davis. This framework differs significantly from the 
other main approaches to this estimation problem. As mentioned before, when large” sample 
sizes are available many long-standing statistical estimation techniques taking advantage of the 
Central Limit Theorem and the Law of Large Numbers can be applied effectively. 


In the area of UQ, function approximation through variations of polynomial expansion tech- 
niques seems to be the most prevalent method currently being pursued. This approach develops 
polynomial expansion approximations of the system G and, with these, is then able to estimate 
the measures of interest concerning the output density based on a priori knowledge of the in- 
put variables. Results using this approach have proven quite effective and useful [1]. In fact, 
when G is smooth, this approach has an exponential rate of convergence [5]. However, when 
@ becomes a high-dimensional vector of inputs it also becomes quite difficult to construct the 


polynomial expansion because of the large number of parameters. 


In the area of density estimation, there has been seminal work building for decades. The basic 





The designation of large versus small data sets is somewhat dependent on the context. There is no set rule as 
to what constitutes one or the other in all circumstances. For the purposes of this thesis, we consider small to be 
less than thirty and large to be 100 or greater, which leaves open a mid-range of values. 


algorithm of nonparametric density estimation was introduced in 1951 [6]. Following that initial 
development, several key papers were published over the next two decades which developed the 
theoretical foundation of nonparametric density estimation [7]. It is not until 1978, however, 
that we see the first practical application of these theoretical methods in a paper concerning risk 


factors in coronary artery disease [8]. 


These developments all contribute to what is now the most prevalent method for nonparametric 
density estimation — the kernel estimator. The basic kernel estimator of a density function 
is constructed by computing a density function for each observation in the data. The shape 
characteristics of the individual function at each observation depends on the selection of the 
kernel function. The most common kernel functions used are the Gaussian, Epanechnikov, 
triangular, and bi-weight density functions because density functions themselves work quite 
well as kernel functions [7]. The densities at each point in the support are then aggregated to 
create the single kernel density estimate. Based on the appropriate selection of a kernel function 
and weights given to the sub-densities at each observation, the aggregated density will integrate 
to one and qualify as a probability density function. It has also been shown that the kernel 


estimator is consistent and has good asymptotic properties [7]. 


1.2 Contributions 

This thesis examines the validity and potential of the epi-spline framework for UQ problems and 
advances the understanding of its application to complex systems using small data sets. The im- 
plementation of this framework and the numerical work done is contributing to the development 
of new model formulations and computational methods. As such, this thesis effort may advance 
the development of estimation toolboxes, which in the future could be used by analysts and 


leveraged to support decision makers within the Department of Defense and elsewhere. 


1.3 Thesis Organization 

Chapter 2 explains in more detail the methodology used in this research. We explain the choice 
of maximum likelihood estimation for determining the objective function of the estimation 
problem. The epi-spline framework is discussed in greater detail, and the incorporation of 


soft information into the estimation problem is explained. 


Chapter 3 lays out the formulation and results from the benchmark testing and analysis portion 
of the research. First, we work through some analytic examples building upon preliminary work 


done on parameterized probability distributions. There has been significant numerical work 


testing the epi-spline framework against known parametric probability distributions. We work to 
benchmark the performance of the epi-spline framework against two textbook functions where 
the output density can be computed analytically using input data from known parameterized 
distributions. We test a quadratic case where € = G(w) = w? and an exponential case where 
& = G(@) = e®. The intent is to illustrate the numerical implementation of the epi-spline 
framework, compare its estimation properties to that of traditional kernel estimates, and provide 


some analysis of its consistency and asymptotic characteristics. 


Chapter 4 explores a real-world application area for the epi-spline framework. Using field 
test data from another NPS thesis effort regarding the impact of waterborne motion exposure on 
cognitive performance, we demonstrate the capability of the epi-spline framework being applied 


to one of the most common, yet most complex, systems known — the human body. 


Chapter 5 concludes the thesis by presenting key findings and commends areas of further re- 


search to the community of interest. 





CHAPTER 2: 
METHODOLOGY 





We propose the use of a flexible framework based on epi-splines, defined in Section 2.2, for 
consistent approximation of infinite-dimensional optimization problems arising in density es- 
timation. We seek to estimate the density 4 of € by maximizing the likelihood function of a 
given sample, €!,&,...,€”, subject to constraints derived from soft information such as sup- 
port bounds, density shape, smoothness, moments, convexity of G, and gradient information 
about G. 


2.1 Maximum Likelihood Density Estimation 


We use the maximum likelihood function to determine the objective function of the estimation 
problem. Suppose that h belongs to a function space # such as a Polish space. A density 


estimate hY of h using a maximum likelihood criterion is given by 


hY € argmax{E" (Inh(&)] | he S” cH}, (2.1) 
h 


where E’|-] is the expectation with respect to the empirical distribution P” generated by the 


sample €!,...,EY, 


Ss” =AYn{he #|h>0as., f n()dé =1}, (2.2) 


and A” is a constraint set that accounts for soft information. Since we seek a probability density, 
we restrict the optimization to nonnegative functions that integrate to one. From the general 
formulation presented above, we now propose to parameterize the problem through a specific 


construction of the epi-spline framework. 


2.2 Exponential Epi-Spline Framework 


In the estimation of density functions, the primary approximation tool will be exponential epi- 


splines.> Given a sample, €',E7,...,&”, the exponential epi-spline estimator of h is given by 
hoes) (2.3) 


where sv : R—> R = RU {—ce, 00} is an epi-spline. One of the main features of epi-splines is that 
they are determined by a finite number of parameters. Furthermore, epi-splines are dense in the 
spaces of continuously differentiable functions, of Lipschitz continuous functions, and of lower 
semi-continuous (Isc) functions [10]. Unlike the construction of standard splines, however, 


epi-splines are constructed with a focus on approximation rather than interpolation. 
The family of epi-splines of order p is defined as follows [10]: 


Given —e% <d_j <dy <---<dy < dy+1 < &, the family of epi-splines of order p, 





e-spl? ([d_,do,d1,...,dn,dy+1]), is the collection of functions s : R > R satisfying: 


(i) s(€) = 00 for € € (—0,d_;] and € € [dy+1,°), 

(ii) sis p times differentiable on (d_1,dy+1), 
(iii) Vk = 1,2,...,N, s?)(E) = constant for E € (dg_1, dx), 
(iv) s?)(E) =0 for E € (d_1,do) and (dy, dy41). 


Here, we use the notation s(P)(.) to denote the p” derivative of s. We note that the tail seg- 
ments of the density, (d_;,do) and (dy,dy+1), have one less degree of freedom than the other 


segments, (dy_1,dy), k = 1,...,N. That is, while (dy_1,dx) has a constant p"” 


derivative, the 
tails have a p” derivative equal to zero. The reason for this different treatment is that the tail 


segments typically do not include data points to support a rational choice of the p’” derivative. 


Optimization over a family of epi-splines using a maximum likelihood criterion produces the 


following estimator. Given sample &!,...,€”, exponential epi-spline estimator hY = eFC), 





3The term epi comes from the fact that the spline-like functions developed rely on epi-convergence results. 
The usual framework for dealing with the convergence of optimization problems is minimization and relies on 
epi-convergence. The epigraph of a function f : X —> R consists of the set of all points in X x R that lie on or above 
the graph of f, i.e., set convergence of epigraphs. [9]. 


where 


s” € argminE” |s(&)] (2.4) 
s.t. s € e-spl?((d_1,do,d\,...,dn,dy-+1]) 
fieOag = 1 
7 s € SY, 


and S” represents the set of constraints placed on the problem that are defined by the available 


soft information. 


Now, for the actual construction of the exponential epi-spline estimator we use the fact that there 
exists a one-to-one correspondence between e-spl”([d_1,do,d1,.-.,dn,dn+1]) and R?™’. There 
exists a function c? : R > R?* such that every s € e-spl?({d_1,do,d1,...,dv,dy41]) takes the 
form 


5(S) = (c?(S),@) (2.5) 


on [d_1,dy+1] for some unique @ € R?** and where c? is a piecewise polynomial of order at 


most p [10]. 


We call a € R?*™ the epi-spline vector. It contains the parameters we seek to optimize based 
on the maximum likelihood criterion for a given estimation problem. The size of the vector 
depends on two things: p — the order of the epi-spline being used, and N — the number of 
discretizations of the support interval of the density function. This means that the number of 
parameters in the problem is not only finite, but also controllable to the extent that the order of 


the epi-spline and the number of discretizations can be chosen deliberately. 


We work exclusively with epi-splines of order two, e-spl’, in this thesis work. We have found 
this to be a sufficiently rich family of epi-splines for our purposes. To illustrate the relationship 


between the epi-spline construction and a familiar density, consider the normal density: 


Because of the quadratic terms in the exponent, the epi-spline estimator would require a constant 
second derivative greater than zero to fully represent the normal density function. To achieve 


that level of fidelity, an epi-spline of order three, e-spl*, would be required. However, empirical 


testing supports that a e-spl* can estimate the normal density very well. 


Contrast this with the exponential density function: 


ce if0<E <0 
0 elsewhere 


In this case, second-order epi-splines are sufficient for fully representing the density function. 
With € appearing in the function only to the first power, the second derivative of the density 
would be zero. As such, the fact that s‘2)(E) = 0 for e-spl* presents no loss of fidelity in the 


estimate. 


Epi-splines of order two, e-spl’, we use in this work are constructed as follows. 
Let s € e-spl?([dy_1,do,d1,-..,dn,dy41]), where dy_1 = —%, dy+, =, and dy = dy +k6, 
k =1,2,...,N, for some 6 > 0. Then, for € € (dy_1, dx] where k= 1,2,...,N+1, 


(8) = sot volo) +85 (E aj+2\aj+3(E-der) am 26 


1 
j=] 





and for € € (d_1,do), 
s(&) = sy — vo(E —db), 


where so = s(do), vo = s'(do), and a;, for j = 1,2,...,N +1, are the constant values of s’’(-). 


In the e-spl’ construction shown above in (2.6), the epi-spline vector @ is (so, vo,@1,...,an) and 
6 refers to the width of the discretization intervals. In our construction, we let 6 be a constant 
width throughout the support of the density estimate. Theoretically, this does not have to be the 
case. There may be situations where it is advantageous to use discretizations of differing widths 


in particular areas of the support range. 
We estimate the epi-spline vector & through the finite-dimensional problem 


dy-+1 


v 
av og argmin )(c?(6'),a) + [ eS). ge QQ) 
a i=l d 


st. a@EAY CRP, 


-1 


which is equivalent to Equation (2.4) under certain assumptions [10]. 
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Note that the objective function is strictly convex because it is the sum of a linear function with 
a strictly convex function. As a result, if the feasible region AY is convex, then the solution 
a” will be unique [10]. This, in turn, produces a unique density estimate based on a’. The 
advantage of working with a convex optimization problem cannot be underestimated in the 
practical sense of finding numerical solutions. Also, note that the requirement for the density to 


integrate to one now appears in the objective function. 


The exponential epi-spline estimator then becomes 
hY(§) =e G).0) (2.8) 


for € € (d_1,dy41) and h’(&) = 0 elsewhere. 


We would like for a statistical estimator to be consistent. Let the density h = e(), with 
s € e-spl?([d_1,do,d1,...,dy,dn+1|) and epi-spline vector a. If {a@”}?_, is a sequence of op- 
timal epi-spline vectors, i.e., those determined by (2.7), for a pairwise independent sample 
El é?. EY from h. Then, @” — a@ and h’ epi-converges to h, as v —> 9, almost surely [10] 


and h’ is, therefore, a consistent estimator. 


2.3 Incorporating Soft Information 

The systematic incorporation of soft information into the estimation problem is one of the great 
strengths of the epi-spline framework. Conceptually, it allows for nearly any information to be 
accounted for in the estimation of h. The only limitation is the ability to construct a mathemati- 
cal constraint which implements the nature of the soft information. Several of the constructions 
currently being explored and implemented are presented below, although, not all of these are 


used in this thesis work. 


2.3.1 Absolutely Continuous Distributions 

First and foremost is the knowledge that the density of the output of interest is in fact a con- 
tinuous distribution. The estimation of the density 4 of € would only be meaningful if a 
density exists. For example, if G is strictly monotonic, or alternatively, differentiable with 
P(VG(@) = 0) = 0, and @ has a probability density, then the distribution of € is absolutely 


continuous. 
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2.3.2 Support Bounds 

Bounds on the support of € can sometimes be derived from soft information about the system 
underlying G and from knowledge about the support of @. For the sake of illustration, suppose 
that G(@) = ||x®||.., where x® is a solution of the differential equation x(t) = f(x(t),@), t € 
(0, 1], with x(0) = x9(@), where both the dynamics and the initial conditions are random. Then, 
under moderate assumptions € = G(@) < (1 +||xo(@)||)e*, where K is a constant related to f. 
Then, bounds on x9(@) yield an upper bound on & that can be incorporated as soft information 
in (2.8). This bound can be strengthened in the case of a linear differential equation. In terms 


of construction, the bounds are easily implemented using d_, and dy +1. 


2.3.3. Unimodality 


It is common for a probability density to be unimodal in its shape. This can be implemented 


‘h order epi-spline has a piecewise 


with bounds on the epi-spline vector. Since by definition a p 
constant p’” derivative, we need only require that particular elements of @ are non-negative. 


Specifically, 


AY = {a =(,0p,...,Qp+n) € R?* | | > 0, (2.9) 


i=pt+l,p+2,...,p+N} 


2.3.4 Moment Bounds 


If Gis known to be convex, then G(E|@]), through Jensen’s inequality, provides a lower bound 
on E[&], which can be used in (2.8). Moreover, if the hard data also includes subgradient 
information about G, which may be the case when G(@) derives from the solution of a boundary 
value problem of a mechanical system; see for example [11], then soft information about € is 
available through cutting plane approximations of G. Specifically, let VG(@!), / = 1,2,...,v, be 
subgradients of G at w!,..., oY. Then, 


A 


G(q@) = max {G(o') +(VG(o!),0—@')}, (2.10) 


is a lower bounding function of G. 


Consequently, the q’” moment E[E“] > E[G(@)4]. Since the right-hand side in this inequality 


may be computable or at least can be estimated with high accuracy, we can generate lower 
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bounds on moments of & for use in (2.7). Specifically, we produce constraints of the form: 


an+1 D v A 
| Ede (05). GE > E[G(w)4]. (2.11) 
d_| 
Similarly, G may generate upper bounds on the cumulative distribution function of € that can 


also be incorporated in (2.7). 


2.3.5 Gradient Bounds 

If G is bijective and continuously differentiable, then gradient information about G provides soft 
information about the density of €. For example, for the univariate case where G is a function 
from R — R, then the density h(€) = fo(@)/G'(@), where fw is the density of @. Hence, 
bounds on the derivative of G translates into bounds on the density of 6. Similar bounds can be 


constructed in higher dimensions with the derivative replaced by the Jacobian determinant. 


2.3.6 Chebyshev Bounds 

For any S Cc Q, where © is the probability space of input vectors @, Chebyshev’s inequality 
gives that P(S) infg<esG(@) < E[&] and hence provides lower bounds on the expectation of €. 
The computation of the left-hand side in this inequality is difficult in general. However, if G 
is a function from R — R and strictly increasing, then the computation is trivial. Under this 
assumption, it is also easy to determine the exact values of the cumulative distribution function 
of & at the data points &!, ..., EY as they equate to those of the cumulative distribution function 
of @ at w!, ..., @”. Information of this kind translates into constraints in (2.8). 


2.3.7 Kullback-Leibler Divergence 

Suppose there was soft information that projected that the density h to be estimated was, in 
some sense, close to a known density function. For example, it could be a qualitative judgement 
from a subject matter expert that h should have a shape similar to some known density. An 
upper bound on the distance from the known density can be implemented through the use of 
the Kullback-Leibler divergence measure. The Kullback-Leibler divergence from density h to 


density g defined on R is 


axu(tile) = [- ME)log A. Q.12) 


The soft information dx,(h||e~*) < « relative to a known density h is implementable as a 


13 


linear constraint for a given constant x. If s € e-spl?((d_1,do,d1,...,dn,dy+1]), then 


du(iile“®) = ( fo" (eng )as ar) + | Gogh’ )ym(E 4g 


—| ay 
where a@ € R?*" is the epi-spline vector of s. 


If, in addition, h = e~* “), s* € e-spl?([d_1,do,d},...,dn,dy+1]), then 
2465 an+1 : 
au(tlle) = ( fer yn(E)as,.—a"), 
-1 


where a* is the epi-spline vector of s* [10]. 
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(2.13) 


(2.14) 





CHAPTER 3: 
BENCHMARK TESTING AND ANALYSIS 





The purpose of this chapter is to verify through empirical testing the validity and potential of 
the proposed epi-spline framework for the applications which have been suggested. We want to 
show that there is a reasonably good fit of the epi-spline estimate to the true density by visual 
inspection for randomly generated input data in the given cases. We also want to verify the fit 


of the epi-spline estimate through analysis of mean squared error (MSE) [7]. 


We compare the epi-spline and standard kernel estimates visually and in terms of MSE. As ker- 
nel estimates represent the most commonly used method of nonparametric density estimation, 
we propose this as a reasonable comparison from which to gauge the fit of the epi-spline esti- 
mate. The kernel estimates used throughout the analysis are computed with Matlab’s ksdensity 
function using a Gaussian kernel construction and its default algorithm for the selection of an 


optimal bandwidth. 


In this chapter, we consider several simple analytic cases where the true density of the output is 
known by way of deliberate construction of the input @ and the function G. We then transition 
to a slightly more complex example of a structural engineering problem taken from [1]. In this 
latter case, the true density is not known, but through a known definition of the function G and 
knowledge of the distributions of the input @ we are able to estimate with high accuracy an 


asymptotically true density through large sampling. 


Now, Thompson states strongly in the Preface to Nonparametric Function Estimation, Mod- 
eling, and Simulation [12] "that we passed diminishing returns in 1-d NDE (nonparametric 
density estimation) around 1978." Although the numerical examples in this thesis are in fact 
one-dimensional, this is where testing and analysis of the proposed epi-spline framework must 


begin. 


The numerical work presented is done using Matlab version 7.12.0.635, 64-bit on an Apple 
MacBook Pro operating on Mac OS X version 10.7.3 with a 2.66 GHz Intel Core 17 quad 
processor and 8 GB 1067 MHz DDR3 memory. We use Matlab’s optimization solver fmincon 
with the following settings: MaxIter = 1000, TolFun = 10~°, TolCon = 107°. 
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3.1 Quadratic 


Let € = G(@) = w”, where w ~ N(0,1). This allows us to compare our results with the true 
density, a chi-square distribution with one degree of freedom, y7(1), see Figure 3.1. In this 
case, we know something about how the "system" G acts upon the input data w and we have 
quite a bit of soft information concerning €. That is, we know G(@) = w? will produce only 
non-negative output. We are able to include this information in the form of support bounds on 
the output by requiring that the left bound be 0. Additionally, we know that the output will be 


unimodal; in fact, it is not just unimodal but also decreasing over its support. 


2.0 


1.5 


1.0 


0.5 


0.0 0.5 1.0 1.5 2.0 2.5 3.0 
Figure 3.1: X7(1) density. 


3.1.1 Low Information Cases 

The first example we present is the estimate of the density of € utilizing no soft information. 
With no soft information, we are limited to estimating the density purely from the data points 
generated. Figures 3.2 - 3.5 illustrate that the estimated density can take on a wide range of 
shapes, and this is true for the epi-spline estimate and the kernel estimate. We also see the 
impact of the sample size. We show estimates based on as many as 100 observations and as 
few as five. The epi-spline and kernel estimates are much closer in shape when there is a larger 


sample from which to construct the estimate. 
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Figure 3.2: Density estimate of € = G(@) = w* with no soft information and n = 100. 
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Figure 3.3: Density estimate of € = G(w) = w? with no soft information and n = 50. 


Table 3.1 shows the MSEs of the fits depicted in Figures 3.2 - 3.5. Note that the MSE tends to 
decrease as we would expect as the sample size increases. The clear exception in this case is 
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Figure 3.4: Density estimate of € = G(@) = @* with no soft information and n = 25. 
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Figure 3.5: Density estimate of € = G(w) = @ with no soft information and n = 5. 


with n = 50. This highlights a key point to make early in the presentation of our analysis. The 
data sets being used are being generated randomly using Matlab’s random number generator. As 
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such, any single data set generated can vary significantly from another until a reasonable level 
of asymptotic behavior is reached. For the data set sizes we are considering, even at n = 100, 
we are not at the point where consistent asymptotic behaviors will be seen. So what we see in 
this particular case with n = 50 is a sample that does not fit into the expected pattern because 
it is a single sample set that happens to have some unique characteristics. Later, we examine 
average MSE of various cases to analyze a more statistically significant result than a single case 


comparison. 


Table 3.1: MSE statistics of € = G(@) = w? with no soft information. 


Sample Size 
100 50 25 5 
Kernel Estimate 0.315614 | 0.441801 | 0.344619 | 1.018454 
Epi-Spline Estimate | 0.331898 | 0.424908 | 0.405645 | 0.936841 























An interesting result, however, from Table 3.1 is that while the kernel estimate performs better 
at n = 100,50, 25, the epi-spline performed better with the least amount of data with n = 5. To 
confirm this result in a more rigorous manner, we generate fifty random data sets of five points 
from the prescribed function G and estimate densities for each of the fifty replications. We 
perform this testing with samples of five since we are primarily concerned with performance in 
small data set situations, and because this is a difficult condition to obtain good estimates. The 
results for the kernel and epi-spline estimates are shown in Table 3.2. The results are encour- 
aging. With an average MSE of almost half that of the kernel estimate and a 96% reduction in 


standard deviation, the epi-spline estimate performs well with very limited information. 


Table 3.2: MSE statistics from 50 replications of € = G(@) = @? with n = 5 and no soft 
information. 


Average | Standard Deviation 
Kernel Estimate 1.432893 2.394339 
Epi-Spline Estimate | 0.749545 0.086183 














We now explore the addition of various aspects and levels of soft information. First, we show 
the impact of the various types of available soft information individually and then we show their 


impact when applied in concert with one another. We use the same randomly generated data 
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sets used in the no soft information case throughout in order to make valid comparisons between 


them. Figures 3.6 - 3.9 show the impact with the unimodality constraint implemented. 
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Figure 3.6: Density estimate of € = G(@) = w” with unimodal constraint and n = 100. 


Table 3.3: MSE of € = G(@) = @ with unimodal constraint. 


Sample Size 
100 50 20 5 
Kernel Estimate 0.315614 | 0.441801 | 0.344619 | 1.018454 
Epi-Spline Estimate | 0.350229 | 0.432829 | 0.429347 | 1.024610 























The soft information constraint forces a unimodal density function as desired; however, from 
visual inspection, the fit does not seem to improve all that much. The noise in the estimates 
created from outlier points is smoothed with the unimodal constraint, but overall it does not 
improve the fit in relation to the true density. Table 3.3 confirms this result in terms of MSE. 
Although the epi-spline and kernel estimates are still quite similar at n = 100, the epi-spline fit 


based on MSE has worsened from the no information case. 
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= (@? with unimodal constraint and n = 50. 
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Figure 3.8: Density estimate of € = G(@) = w? with unimodal constraint and n = 25. 
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Figure 3.9: Density estimate of € = G(w) = w* with unimodal constraint and n = 5. 


In stark contrast, the estimates with only a non-negative support bound implemented produce 


much better fits based on visual inspection. Figures 3.10 - 3.13 illustrate these results. 


Table 3.4: MSE of € = G(@) = w? with non-negative support. 


Sample Size 
100 50 2 5 
Kernel Estimate 0.016502 | 0.014951 | 0.008257 | 0.032190 
Epi-Spline Estimate | 0.018888 | 0.019785 | 0.017624 | 0.157543 























Table 3.4 shows the MSEs of the estimates based on non-negative support only. Note that the 
MSEs for the kernel estimates are different than the previous kernel estimates for these same 
data sets. Matlab allows for the inclusion of support information in its kernel based density esti- 
mates. In order to make a more accurate and objective comparison of the performance between 


the two estimation methods, we include the support information available in both. 
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Figure 3.10: Density estimate of € = G(w) = w* with non-negative support and n = 100. 
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Figure 3.11: Density estimate of € = G(w) = w* with non-negative support and n = 50. 
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Figure 3.12: Density estimate of € = G(w) = w* with non-negative support and n = 25. 
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Figure 3.13: Density estimate of € = G(@) = w? with non-negative support and n = 5. 
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3.1.2 High Information Case 


We now consider the high information case in which all of the soft information considered 
up to this point is combined in a single estimate. Table 3.5 and Figures 3.14-3.17 show the 
results when non-negative support is combined with the decreasing function constraint. Note 
that we do not need to explicitly include the unimodal constraint. Unimodality is implicit in a 
decreasing function so we can enforce both aspects of the shape with a single constraint. We 
also note that the kernel estimate MSE remains the same as in the case of non-negative support 


only. The kernel estimate is only able to incorporate support information, so it does not change 
with the additional information introduced here. 


Table 3.5: MSE of € = G(@) = @* with non-negative support and decreasing constraint. 


Sample Size 





100 


50 
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Figure 3.14: Density estimate of € = G(w) = w” with non-negative support, decreasing con- 
straint, and n = 100. 
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Figure 3.15: Density estimate of € = G(@) = w* with non-negative support, decreasing con- 
straint, and n = 50. 
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Figure 3.16: Density estimate of € = G(w) = w” with non-negative support, decreasing con- 
straint, and n = 25. 
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Figure 3.17: Density estimate of € = G(w) = w” with non-negative support, decreasing con- 
straint, andn = 5. 


For a more rigorous analysis of the performance comparison between the kernel estimate and 
the epi-spline estimate, we again perform fifty replications of the test scenario. Table 3.6 shows 
that, although we saw higher MSE levels with the epi-spline estimates in the anecdotal data 
case above, the epi-spline estimate performs well in comparison to the kernel estimate. The 
average MSE of the epi-spline estimate in the case of no soft information was 0.749545. With 


the addition of basic soft information, we see a 75% reduction in the average MSE and with a 
very similar standard deviation. 


Table 3.6: MSE statistics from 50 replications of € = G(@) = w* with non-negative support, 
decreasing constraint, and n = 5. 








Average | Standard Deviation 
Kernel Estimate 5.912541 20.029658 
Epi-Spline Estimate | 0.187737 0.071299 
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The small standard deviations that we see in the epi-spline estimates are of particular note. The 
fact that the estimates perform this well with only five data points speaks strongly of its potential 
for practical application. In addition, the wide ranging performance of the kernel estimate in the 
context of small data sets indicates that, at a minimum, great care must be used when applying 


a kernel based estimation process to a small data problem situation. 


Now we consider a variation of the quadratic case to illustrate the application of soft information 
related to the Kullback-Leibler (KL) divergence discussed in Chapter 2. The idea behind apply- 
ing this as soft information is that there is some qualitative knowledge about what the shape of 
the output density should be like. Maybe this comes from a subject matter expert familiar with 
the system. By applying bounds on the KL divergence measure we are able to restrict the dis- 
tance of the estimate from a known density. The tighter the bound on the divergence, the more 
the estimate will mirror the known reference density. Recall the notation dxz(h||e~*) <c to 
mean that the KL divergence measure between the known reference density h and the epi-spline 


estimate e~*‘) must be within some stipulated constant c. 


For this test case, consider the sum of the squares of ten standard normal random variables, 1.e., 
E = G(@) = @; + @ +--- + 7, where @ ~ N(0,1). This construction creates a true density 
h(&) which is chi-square with ten degrees of freedom, X7(10). Figure 3.18 shows the results of 
the estimates based on five data points and no soft information. 
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Figure 3.18: Density estimates of § = G(@) = @ +---+ @f,) withn =5. 
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First, we include the basic soft information that is available — non-negative support and uni- 


modal shape. This result is shown in Figure 3.19. 
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Figure 3.19: Density estimates of € = G(@) = @? aaa Oi with non-negative support, uni- 
modal, andn =5. 


We then add a relaxed KL divergence bound of one, shown in Figure 3.20. 


















































True Density 
0.31 Epi-Spline Estimate 
—-——- Kerel Estimate 
——< Empirical Density 
0.25; 
0.2- 
O15; 
2 
a 
c 
o 
TG O1F 
0.05 + 
0 
0.05 - 4] 
O.4 : ; 
5 10 15 


xi 


Figure 3.20: Density estimates of € = G(@) = o? fees Oi with non-negative support, uni- 
modal, dx, < 1, andn =S. 
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The addition of the initial KL bound of one brings down the peak of the epi-spline estimate 
ever so slightly. We then show the impact of gradually decreasing the bound to 0.1 and then 
to 0.01 in Figures 3.21 and 3.22, respectively. The epi-spline fit improves significantly as the 
bound decreases, and at the final value of 0.01, the estimate is much closer to the general shape 


characteristic of the true density. 
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Figure 3.21: Density estimates of € = G(@) = wo? fees O;, with non-negative support, uni- 


modal, dx, < 0.1, andn =5. 
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Figure 3.22: Density estimates of € = G(@) = @? ces O2, with non-negative support, uni- 
modal, dx, < 0.01, andn =5. 
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3.2 Exponential 

We now consider a second analytic case where we, again, can compare results to a true known 
density. In this case € = G(@) = e® where @ ~ N(0,1). With this construction, we know 
that the density 4 is Log Normal with a mean of zero and variance of one, LogN(0,1), see 
Figure 3.23. Again, we have some information about G, @, and €. We know € = G(@) = 
e® will produce only non-negative output, so we can implement a left support bound of 0. 
We also know that the density is unimodal. Unlike G(@) = @”, however, we do not have 
a decreasing function in this case. We will use this example to explore other types of soft 
information constraints that were discussed in Chapter 2. At this point we have shown the 
effect of sample size, so we restrict our illustrations to cases with smaller data sets, i1.e.,n = 5 
and 25. 
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Figure 3.23: LogNormal(0, 1) density. 


3.2.1 Low Information Cases 
First, we consider the comparison of the epi-spline estimate including no soft information versus 
the kernel estimate shown in Figures 3.24 and 3.25. We see clearly that the epi-spline estimate 


based on five observations is a bit noisy without any additional information. 


Both the kernel and epi-spline estimates improve dramatically with 25 observations; however, 


from visual inspection the epi-spline appears to better capture the density near zero where the 
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Figure 3.25: Density estimates of € = G(@) = e® with no soft information and n = 5. 


true density peaks. This is confirmed by the MSE results in Table 3.7, which summarizes the 
MSE results for the three low information cases we present. 
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Table 3.7: MSE for low information cases of € = G(@) = e®. 














n=25 n=5 
Kernel Estimate 0.066753 | 0.119479 
Kernel - non-negative support 0.167938 | 0.065961 
Epi-Spline - no soft information 0.039013 | 0.265073 
Epi-Spline - unimodal 0.026743 | 0.114233 
Epi-Spline - unimodal, non-negative support | 0.026871 | 0.116977 





Unlike the previous example of G(@) = w7, the MSE for the epi-spline estimate with no soft 
information for the random data set is worse than the kernel estimate. To better determine its 
true performance, we conduct fifty replications of the estimates on random data sets of size five. 
Table 3.8 shows that, once again, the epi-spline estimate performs significantly better than the 


kernel estimate in a very limited data context, particularly in terms of its variability. 


Table 3.8: MSE statistics from 50 replications of § = G(@) = e® estimates with no soft infor- 
mation and n =5. 


Average | Standard Deviation 
Kernel Estimate 0.526273 2215139 
Epi-Spline Estimate | 0.403309 0.269902 














With the unimodal constraint included in the estimation, the epi-spline estimates for both sample 
sizes shown in Figures 3.26 and 3.27 improve dramatically. And while the estimate with n = 5 
may visually look a bit strange, the MSE of the estimate is less than that of the kernel estimate. 


For relatively little information, the epi-spline estimate with n = 25 is surprisingly accurate. 


In this particular example, the addition of the support bound constraint restricting the density to 
non-negative values does not improve the overall estimate. These fits are shown in Figures 3.28 
and 3.29. The change in MSE from the unimodal constraint only is insignificant and the overall 


shape of the density is not dramatically changed. 


3.2.2 High Information Cases 
For the high information cases, we consider several interesting soft information implementa- 


tions which have not yet been illustrated in this thesis. We analyze the impact of including 
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Figure 3.26: Density estimates of € = G(@) = e® with unimodal constraint and n = 25. 
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Figure 3.27: Density estimates of € = G(@) = e® with unimodal constraint and n = 5. 


gradient information and bounds on the value of the epi-spline at the initial point in the support 


of the estimate, do. 
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Figure 3.28: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, and n = 25. 
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Figure 3.29: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, andn=5. 
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Suppose a situation where, although G may be unknown in terms of being able to explicitly 
formulate a mathematical expression that accurately reflects the system in question, gradient 
information related to particular system output values is known. While this may initially sound 
like a highly manufactured situation for illustrating this source of soft information, this is a 
realistic scenario. Extracting gradient information from complex systems such as large scale 
simulations is an area of research all its own. There continues to be extensive research in this 


area and there are currently several methods for precisely this need [13]. 


The motivation behind these techniques for extracting the gradient information is actually some- 
what similar to the motivation for this research. In situations where the simulation is so costly 
in terms of time and/or money, conducting a large number of replications may be impractical. 
The need to do sensitivity analysis on the system performance motivates the application of these 


gradient extraction techniques. 


Figures 3.30 and 3.31 show the impact of gradient information at one data point. Although 
the shape of the epi-spline estimate does not change significantly, in the n = 5 case the MSE 


decreases substantially, see Table 3.9. 
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Figure 3.30: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, gradient at 1 point, and n = 25. 
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In the n = 25 case, Figure 3.30, the fit is already quite good for the particular sample so the 
addition of gradient information at one point does not improve the estimate. However, for a 
comparable percentage of gradient information points from the overall sample size, i.e., one 
point out of five is 20%, we consider five gradient points in the n = 25 case. 
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Figure 3.31: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, gradient at | point, and = 5. 


Table 3.9: MSE for high information cases of € = G(@) = e®. 














n= 2 n=5 
Kernel Estimate 0.167938 0.065961 
Epi-Spline - gradient at | point 0.031002 0.075030 
Epi-Spline - gradient at 5 points 0.012840 0.023152 
Epi-Spline - gradient at 5 points, bounds on do | 0.004925 0.005560 





Note: all epi-spline estimates include unimodal constraint and non-negative support. 


Figures 3.32 and 3.33 show the resulting estimates with gradient information at five points. 
Obviously, in the n = 5 case, this constitutes gradient information at all points so the estimate 
improves dramatically by visual inspection and by MSE. Also, as we would expect, the esti- 


mate improves dramatically in the n = 25 case. The gradient information is implemented as an 


af 


equality constraint in the estimation problem, so it clearly has a strong implication in the results 


of the overall estimate. 
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Figure 3.32: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, gradient at 5 points, and n = 25. 
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Figure 3.33: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, gradient at 5 points, and n = 5. 
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The last type of soft information we consider in this test case is regarding a bound on the value 
of the epi-spline at the initial point in the density support, dp, which is very applicable in this 
example. We have limited the density to non-negative support, but bounding the value of the 
epi-spline is a different type of bound. With this we are saying that the density value itself, 
at the initial point dg, is bounded by some value. In this example, the support range is not 
only bounded at zero, but the density value at this point is also zero. With this type of soft 
information, we are able to restrict the density starting point closer to what we know to be the 
true starting point density. Figures 3.34 and 3.35 show the addition of the do bound to the 


previous level of information. 
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Figure 3.34: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 
port, gradient at 1 point, bound on do, and n = 25. 


Both sample size cases show the impact of the do bound particularly well. In the n = 25 case, 
the previous estimates of the density value at dy were in the 0.4 — 0.5 range. We know the true 
density at do is in fact zero. The upper bound that was imposed in this illustration reduces the 
density value at dp to around 0.2, much closer to the actual value and significantly improving 
the visual fit of the estimate. In the n = 5 case shown in Figure 3.35, the density value at do 
reduces to almost zero. The MSE results in Table 3.9 confirm the improvement as well. We see 


an order of magnitude reduction in the epi-spline MSE at both sample sizes. 
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Figure 3.35: Density estimates of € = G(@) = e® with unimodal constraint, non-negative sup- 


port, gradient at 1 point, bound on do, andn = 5. 


We provide MSE statistics from fifty random replications of several high information scenarios 


for the exponential case with n = 5. Although, we see a significant improvement in the kernel 
statistics when the non-negative support bound is included, the epi-spline estimates continue 
to outperform in average MSE and its variability. The average MSE for the epi-spline with 
gradient information at one point is neglibly higher than the kernel estimate, but with a much 


larger reduction in variability the epi-spline estimates would be favored in this regard. 


Table 3.10: MSE statistics from 50 replications of high information cases of € = G(@) = e® 


estimates with n = 5. 





Average Standard Deviation 
Kernel Estimate 0.542005 1.200533 
Epi-Spline with gradient at | pt. 0.552342 0.998670 
Epi-Spline with gradient at 2 pts. 0.277791 0.625790 
Epi-Spline with gradient at 5 pts., bounds on dg | 0.232691 0.388744 











Note: all epi-spline estimates include unimodal constraint and non-negative support. 
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3.3. Column 


As a final example case, we borrow a structural engineering problem presented in UQ research 
done at Sandia National Laboratory [1], [5]. The problem pertains to the analysis of a rectan- 
gular column with cross section dimensions b = 5 and h = 15. We let the material have a yield 
stress value of s = 5. The column is subject to uncertain loads which bring with it two uncertain 
inputs to the problem — bending moment and axial force. These sources of uncertainty produce 


a random vector @ of length two rather than the single variable cases we have tested thus far. 


Let the bending moment @, be normally distributed as N(2000,400) and the axial force @2 be 
normally distributed as N(500, 100) with a correlation coefficient of 0.5 between them. In this 
example, the system G is the column’s limit-state function such that a negative value indicates 


failure of the column. Let 
4Q ws 


bh2s — b2h2s2 eel) 


G(@) = 1- 
where the random vector @ = (@), @). 


There is no true density with which to make comparisons, but with full information regarding @ 
and a mathematical formulation for G, we are able to accurately estimate an asymptotically true 
density by taking a random sample of one million observations with a coefficient of variation 
of 0.0381. Figure 3.36 shows the probability density of the column mean strength based on the 


one million random samples. 
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Figure 3.36: Estimated true density of column example based on 1 million samples. 
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The purpose of this example is to simply illustrate that the epi-spline estimator can effectively be 
applied in a more complex system scenario, as a contrast to the simple single variable examples 
we have shown thus far. As such, we limit this example scenario to a single data sample case for 
illustration purposes. We generate a random sample of n = 20 based on the system construction 
in (3.1) and explore the application of two possible sources of soft information — unimodality 
and gradient information. Figure 3.37 shows the epi-spline and kernel estimates with no soft 


information. 
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Figure 3.37: Density estimates of the column example with no soft information and n = 20. 


With the addition of unimodality, shown in Figure 3.38, the epi-spline estimate is already quite 
close to the asymptotic true density function. Lastly, we suppose gradient information at five of 
the data points is available. Figure 3.39 shows that the epi-spline estimate based on the twenty 


data points and two information sources is almost indistinguishable from the true density. 


This example reinforces, empirically, that the epi-spline framework may be applied success- 
fully to more complex systems. The key is in identifying the available soft information from a 
particular problem context and constructing the appropriate problem constraints to enforce the 


desired behavior in the estimate. 
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Figure 3.38: Density estimates of the column example with unimodal constraint and n = 20. 
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Figure 3.39: Density estimates of the column example with unimodal constraint, gradient in- 
formation at 5 points, and n = 20. 
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CHAPTER 4: 
HABITABILITY ASSESSMENT TEST ANALYSIS 





In this chapter, we apply the epi-spline framework to the analysis of a particular real-world 
problem that is the subject of another NPS thesis [14] and illustrate the potential of the frame- 


work. 


4.1 Test Background and Purpose 

The Habitability Assessment Test (HAT) was conducted to collect data in support of research 
studying the effects of waterborne motion on the combat efficiency of individual soldiers. Using 
personnel from the United States Marine Corps (USMC), the Amphibious Vehicle Test Branch 
(AVTB) conducted the HAT in August 2011 at USMC Base Camp Pendleton, California. The 
study defines combat efficiency by "functions affecting an infantryman’s ability to conduct com- 
bat operations during an amphibious assault." [4] In particular, the study has focused on three 
primary areas of human function — physical coordination, sensory perception, and cognitive 


performance. 


Human function is an area with a great deal of inherent uncertainty. There is, arguably, no more 
complex a system known to man than that of the human body. The number of variables im- 
pacting the performance of one human over another in a given setting is seemingly endless. As 
such, the need to quantify this uncertainty in a rigorous way for the type of research being con- 
ducted with the HAT described above is of key importance. Particularly because the results and 
analysis of this study, and many others like it, are informing key decisions by senior leadership 
within the Armed Forces. 


The HAT study is of particular importance to key decisions concerning the design of future 
amphibious landing vehicles for the USMC. The structure of the test revolved around three basic 
vehicle configurations — Expeditionary Fighting Vehicle (EFV) with cooled air conditioning, 
EFV with vent air conditioning, and Amphibious Assault Vehicle (AAV). The test then exposed 
four squads of fifteen* to varying levels of exposure — zero hours as a control condition, one 


hour, two hours, and three hours — in those three different amphibious vehicle configurations. 





‘Tn the test, one of the four squads was sixteen personnel rather than fifteen, but for consistency of sample size, 
that squad’s results have been reduced to fifteen observations for our numerical work. 
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Following the waterborne exposure, the Marines were then circulated through a battery of tests 


measuring their performance in the three human function areas previously mentioned. 


We focus our analysis on the cognitive performance area of the HAT study in relation to vehicle 
configuration and exposure duration. We have chosen to focus on cognitive performance be- 
cause results from the HAT study [4] indicate that there was an impact on cognitive throughput 
in relation to exposure duration. Results also stated that significant impacts to markmanship 
scores, a measure of sensory perception, and physical coordination performance were not ob- 
served. Specifically, we consider the percentage difference in cognitive throughput? following 


the waterborne exposure, which is used as a measure of cognitive performance. 


4.2 Application of Epi-Spline Framework 

In our epi-spline framework notation, the random vector @ in this case represents the various 
conditions of the test scenario. For example, @ is comprised of the vehicle configuration, dura- 
tion of waterborne exposure, speed of the vehicle, and sea state conditions during the exposure 
period. The system can be thought of as the test itself, which is measuring human performance. 
In this case, the many human parameters of the individual Marines such as height, weight, 
physical condition, intelligence, etc. become part of the random vector @ and follow some 
probability distribution as we have seen in the Chapter 3 examples. 


We desire an estimate of the density h of € equal to the percentage change in cognitive through- 
put from before the waterborne exposure to after waterborne exposure. Since we are examining 
the change in the cognitive measure, a negative value indicates a decrease in cognitive perfor- 
mance whereas a positive value indicates an increase in performance. Zero indicates no change 
and is interpreted as meaning that the waterborne exposure has no impact on cognitive perfor- 


mance. 


The analysis of the data from the three vehicle configurations is a similar process, so we use 
the EFV - cooled air configuration as the primary illustration case. Each of the estimates is 
constructed from the fifteen squad member observations, so n = 15 in all of the estimates. 


Figure 4.1 shows the density estimations for each of the four durations — 0, 1, 2, and 3 hours — 
under the EFV - cooled air configuration with no soft information. The noisy estimates reflect 


the little information available in the small data samples. 





>Cognitive throughput is a measure of the number of questions answered correctly per minute on a cognitive 
test used in the HAT. 
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Figure 4.1: Density estimates of change in cognitive throughput based on waterborne exposure 
in EFV - cooled air configuration with no soft information. 


We incorporate a unimodal constraint based on information from researchers familiar with the 
test. Initially, we hypothesized that the impact of waterborne exposure would be a degradation 
in performance across all of the study participants. This would have led to additional soft 
information concerning the densities. We anticipated the inclusion of a support bound that 
would prevent values that indicate an improvement in performance as a result of exposure and 


a monotonic constraint. 


The test results, however, just do not support that hypothesis. In human factor testing, there 
are often confounding factors that impact results and complicate analysis. In this case, there 
are test observations which indicate an improvement in performance following exposure. To 
implement a support bound in this case is akin to discounting valid observations, which would 
not only decrease already small samples, but would also call into question any conclusions made 
from the analysis. 


The estimates including the unimodal constraint are shown in Figure 4.2. Compared to the no 


soft information estimates, these estimates are more in line with a qualitative understanding 
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of the results. Specifically, consider the similarity between the one and three hour exposure 
estimates. Their shape is almost identical and their relationship is what we expect to see. That 
is, the three hour estimate is shifted to the left of the one hour estimate, which indicates that 
there is a greater decrease in cognitive performance with the increased exposure time. 
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Figure 4.2: Density estimates of change in cognitive throughput based on waterborne exposure 
in EFV - cooled air configuration with a unimodal constraint. 


This application presents some very interesting possibilities of how to leverage the flexibility of 
the epi-spline estimation framework for study situations similar to this. As we discussed previ- 
ously, we recognize that we are dealing with unique challenges to fully represent performance 
based on the small data samples. As we saw in Figure 4.2, the densities, and in particular their 
relationships to one another, did not always fit our intuition of how the results should change 
with increased exposure. For example, Figure 4.2 would lead you to believe that performance 
improves more dramatically after one hour of exposure than with no exposure, and that perfor- 


mance is degraded more significantly after two hours of exposure than after three hours. 


These conclusions do not match qualitative information that the relationship between exposure 
time and the impact on performance should be essentially monotonic. That is, as exposure time 


increases, performance should decrease. But with only fifteen observations, it is very easy to 
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obtain estimates that contradict such qualitative information. This is where we can leverage the 
ability of the epi-spline estimation framework to incorporate soft information to achieve a better 


description of system performance. 


Consider, then, the incorporation of an upper bound on the first moment, or mean, of the density 
estimate. Without discarding data observations as a support bound would potentially do, a 
bound on the mean requires the majority of the density’s mass to reside in a particular region. 
With this constraint implemented in a progressive manner in the four exposure time cases, we 
can construct estimates that, in some sense, accomplish the qualitative relationship we believe 


is present between the true densities. 


The "Data sample" column of Table 4.1 shows the sample means calculated directly from the 
fifteen data points in each case. The subsequent columns are the means calculated from the 
epi-spline estimates with the stated information. The four estimates with the bound on the first 


moment implemented are shown in Figure 4.3. 


Table 4.1: Mean estimates of USMC HAT data. 





Data sample | Epi-Spline, unimodal | Epi-Spline, 1°’ moment bound added 
no exposure 0.035687 0.035687 0.035687 (bound = 0.036) 
1 hour exposure | 0.091998 0.091997 0.035000 (bound = 0.035) 
2 hour exposure | -0.027371 -0.027401 -0.027371 (bound = 0.035) 
3 hour exposure | -0.070614 -0.070615 -0.070613 (bound = -0.027) 














The sample mean in the control group, no exposure, is 0.035687. We place a relaxed bound 
on the first moment of 0.036 and find that the mean of the epi-spline estimate is unchanged, 
indicating that the constraint is not active. To implement the moment bounds in a progressive 
manner, we then use a slightly restrictive version of the control group mean, 0.035, as the upper 


bound on the | hour exposure estimate. 


We see from Table 4.1 that, of the four exposure cases, the 1 hour exposure estimate is the 
most impacted by the upper bound on the first moment. The 1 hour exposure sample mean is 
0.091998 and the mean of the epi-spline estimate with only the unimodal constraint is almost 
identical at 0.091997. However, with the first moment bound of 0.035 introduced, the estimate 
responds as we expect. Mass of the density is redistributed and the new estimate’s mean is 


precisely 0.035, indicating that the constraint is active in the estimation problem. Note that 
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the location of the mode of the 1 hour exposure estimate does not change from the previous 
estimate shown in Figure 4.2, but the height of the density at the mode is reduced and the mass 


in the left tail increases. 


We continue the same progressive implementation of the bounds on the mean with the 2 and 3 
hour exposure cases. That is, we use the calculated mean of a given exposure case as the bound 
on the mean of the following case. For example, the final estimate for the 1 hour exposure case 
is 0.035, so this becomes the bound on the mean of the 2 hour exposure case. In both the 2 and 3 
hour exposure cases, the bounds do not restrict the estimate and, thus, the means are essentially 
unchanged. 
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Figure 4.3: Density estimates of change in cognitive throughput based on waterborne exposure 
in EFV - cooled air configuration with unimodal constraint and progressive upper bounds on 
the 1° moment. 


The constraint formulation used here to place bounds on the first moment is easily extendable 
to additional moments. For example, limits on the variability of a density can be implemented 


through bounds on the second moment. 
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The adjustment to the estimates as a result of the moment information available is significant. 
Consider that these probability density estimates will be used for random variate generation. For 
example, the estimates may be used to generate inputs for another system such as a computer 
simulation for evaluating potential vehicles for the USMC. The estimated probability densities 
are used to generate random variables as stochastic inputs to the simulation. The epi-spline 
estimates have the potential of more accurately representing the random variates because of 


their ability to utilize available soft information. 
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CHAPTER 5: 
CONCLUSION 





This thesis deals with the problem of estimating system performance in a fully descriptive statis- 
tical manner when significant uncertainty is present, that is, we seek to quantify the uncertainty 
about the performance in an accurate and rigorous way. We propose the application of the epi- 
spline estimation framework for various uncertainty quantification contexts. In particular, we 
emphasize the flexibility of the epi-spline framework in its ability to systematically incorporate 
any soft information available in a given problem context to further enhance the accuracy of the 
estimate. We present results and analysis that clearly support the vast potential of the epi-spline 


framework in the area of UQ and nonparametric density estimation. 


5.1 Key Findings 

In the analytic cases presented in Chapter 3, we present empirical results that support the basic 
proposition that the epi-spline framework produces very reasonable estimates when compared 
to known probability densities. We also present results of how the epi-spline estimates compare 
to kernel estimates based on the same data sets. The epi-spline outperforms the kernel estimate 


in most of the benchmark cases that we examine based on MSE statistics. 


We present epi-spline estimates from several analytic test cases where we can compare our 
estimates with the true output densities. Initially, we also illustrate the impact of the sample size 
on the estimates, which, along with the mathematical support provided in Chapter 2, indicates 


that the estimates approach the true density as the sample size goes to infinity. 


Fifty random replications with samples of five observations and no soft information shows that 
the epi-spline has an average MSE almost 48% lower than the kernel estimates with a 96% 
reduction in standard deviation. A similar fifty replication test shows a reduction of over 23% 
in average MSE with an almost 88% decrease in standard deviation. Continuing analysis reveals 
that fifty random replications of the high information based epi-spline estimates further reduce 
the average MSE over the no soft information epi-spline estimates by upwards of 65% with a 


standard deviation reduction of another 56%. 


The structural engineering column example illustrates the ability of the epi-spline framework to 


perform well under a more complex system function where we have multiple sources of uncer- 
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tainty in the input vector. We continue to see epi-spline estimates that significantly outperform 
those produced by a standard kernel estimator. 


Lastly, we explore the application of the epi-spline framework to a current real-world research 
study context. We discuss the identification of sources of soft information and how they may 
or may not be appropriate in certain circumstances. We also discuss a novel method to address 
shortcomings in small data sets in order to produce estimates that more accurately represent the 


qualitative understanding of the problem context. 


5.2 Future Research 

There are several directions in which this thesis work can be expanded to complement the 
broader research effort. One primary expansion area is to move the empirical work to two- 
dimensional problem contexts where the system performance output € is no longer univariate, 
but represents a random vector of output measures. In the multivariate case, the elements of & 


may be correlated and this adds further complexity to the multidimensional estimation problem. 


Another area of significant potential is the identification of additional sources of soft informa- 
tion from various problem contexts, particularly those with defense related applicability. The 
indentification of the sources of information as well as the constraint formulations that facilitate 
the incorporation of the information into the estimation problem is of vital importance. The 
unique strength of the epi-spline framework rests in its ability to leverage soft information for 


improved estimates. 


Further research in how to best leverage current methods of gradient estimation in stochastic 
simulation would be particularly beneficial. Empirical results presented in this thesis make a 
strong argument for the potential of gradient information to improve density estimates. Sim- 
ulation is an area of great potential for the application of the epi-spline framework in both 
output analysis and input generation. Many of the current gradient estimation techniques being 


researched are particularly applicable to the simulation context. 


Lastly, there are significant opportunities to continue development and refinement of the numer- 
ical implementation of the estimation framework and information constraints. The epi-spline 
estimator is currently implemented in Matlab, but work in other languages could add value to 


the research effort and make the area more accessible to others in the community of interest. 
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